长文综述:给生物学家的机器学习指南
导语
过去几十年,生物数据集的规模与复杂性大幅增长,这使得机器学习越来越多地用于为潜在生物过程构建信息与预测模型。所有机器学习技术都在让模型与数据相匹配;然而,具体的方法多种多样,乍一看似乎令人眼花缭乱。对于不同类型的生物数据,该如何选择特定的机器学习技术?
2021年9月,发表在 Nature Reviews Molecular Cell Biology 上的综述文章“给生物学家的机器学习指南”,向读者简要介绍了一些关键的机器学习技术:既包括分类、回归、聚类模型等传统机器学习方法,也包括最近开发和广泛使用的涉及深度神经网络的技术。本文还记录了一些最佳做法与入门要点,并展望了机器学习应用于生物学的一些最令人兴奋的前景。
研究领域:机器学习,人工神经网络,生物数据
Joe G. Greener, Shaun M. Kandathil等 | 作者
赵雨亭 | 译者
陈斯信 | 审校
邓一雪 | 编辑
论文题目:
A guide to machine learning for biologists
论文链接:https://www.nature.com/articles/s41580-021-00407-0
目录
引入关键概念传统机器学习人工神经网络生物应用的挑战
关键概念
关键概念
方框1 | 做机器学习
这个方框概述了在训练机器学习模型时应该采取的步骤。令人惊讶的是,关于模型选择和训练过程的指导很少 [146,147],对垫脚石和失败模型的描述也很少被纳入已发表的研究文章。在接触任何机器学习代码之前,第一步应该是充分理解手头的数据(输入)和预测任务(输出)。这意味着对问题的生物学理解:例如了解数据的来源和噪声源,并知道,根据生物学原理,输出理论上是怎样从输入中预测的。例如,可以推断不同的氨基酸可能对蛋白质中的特定二级结构有偏好,因此,从蛋白质序列中每个位置的氨基酸频率预测二级结构是有意义的。了解输入和输出在计算上是如何存储的也很重要。它们是否已被归一化以防止一个特征对预测产生过大的影响?它们是编码为二进制变量还是连续编码?是否有重复的条目?是否缺少数据元素?
接下来,应拆分数据以进行训练、验证和测试。有多种方法可以做到这一点,图 2 中显示了其中的两种方法。2a. 训练集用于直接更新被训练模型的参数。验证集通常占可用数据的 10% 左右,用于监控训练、选择超参数并防止模型过度拟合训练数据。通常使用k折交叉验证(k-fold cross validation):将训练集分成 k 个大小均匀的分区(例如,五个或十个)以形成 k 个不同的训练和验证集,并在每个分区之间比较性能以选择最佳超参数。测试集,有时也称为“留出集”,通常也占可用数据的 10% 左右,用于评估模型在未用于训练或验证的数据上的性能(即估计其预期的真实表现)。测试集应该只在研究的最后使用一次,或者尽可能不频繁地使用 [27, 38] ,以避免模型拟合了测试集。关于制作一个公平的测试集时需要考虑的问题,请参见数据泄漏一节。
下一步是模型选择,这取决于数据的性质和预测任务,总结在图1中。1. 训练集用于训练模型,按照所用软件框架的最佳做法。为达到最佳性能,大多数方法都有一些超参数需要调整。这可以使用随机搜索或网格搜索来完成,并且可以与上面概述的k折交叉验证相结合 [27]。研究人员应该考虑模型集成,将多个相似模型的输出简单平均,以提供一种相对可靠的方法,来提高建模任务的整体准确性。最后,应该评估模型在测试集(见上文)上的准确性。
图1. 选择和训练机器学习方法。训练机器学习方法的整个过程显示在上部。下面给出了帮助研究人员选择模型的决策树。这个流程图旨在作为一个视觉指南,将本综述中概述的概念联系起来。然而,像这样的简单概述不能涵盖所有情况。例如,机器学习要变得适用所需的数据点数量,取决于每个数据点可用的特征数量——特征越多,需要的数据点越多,还取决于所使用的模型。还有一些深度学习模型可以处理未标记的数据。
图2. 训练机器学习方法。(a)可用数据通常分为训练、验证和测试集。训练集直接用于训练模型,验证集用于监控训练,测试集用于评估模型的性能。也可以使用带有测试集的k折交叉验证。(b) 独热编码是表示分类输入的常用方法,只允许从多种可能性(在这里是三种可能的蛋白质二级结构类别)中进行单一选择。编码的结果是一个包含三个数字的向量,除了被占用的类设置为 1 外,所有数字都等于 0。这个向量被机器学习模型使用。(c)连续编码表示数字输入,在这种情况下是图像中像素的红色、绿色和蓝色 (RGB) 值。结果同样是一个包含三个数字的向量,对应于像素中红色、绿色和蓝色的数量。(d)未能学习到变量之间的潜在关系称为“欠拟合”,而学习了训练数据中的噪声称为“过拟合”。欠拟合可能是由于使用了不够复杂的模型来描述信号。而过拟合可能是由于使用了参数过多的模型或在学习了变量之间的真实关系后继续训练。(e)模型的学习率决定了在训练神经网络或一些传统模型(如梯度提升)时调整学习参数的速度。低学习率会导致训练缓慢,这既耗时又需要相当大的计算能力。相反,高学习率会导致快速收敛到一个非最优解和模型性能不佳。(f)提前停止(early stopping)是在验证集上的损失函数开始增加时终止训练的过程,即使训练集上的损失函数仍在减少。使用提前停止可以防止过拟合。
传统机器学习
传统机器学习
表1 不同机器学习方法的比较
图3. 传统的机器学习方法。(a) 回归寻找因变量(观察到的属性)和一个或多个自变量(特征)之间的关系;例如,从一个人的身高预测一个人的体重。(b) 支持向量机 (SVM) 转换原始输入数据,以便在它们的转换版本(称为“潜在表示”)中,属于不同类别的数据被尽可能宽的明显间隙划分。这里展示了对蛋白质是有序还是无序的预测,横轴与纵轴代表转换数据的维度。(c) 梯度提升使用一组弱预测模型(通常是决策树)来进行预测。例如,活性药物可以从分子描述符(例如分子量和特定化学基团的存在)中预测。各个预测器以阶段方式组合以进行最终预测。(d) 主成分分析 (PCA) 会发现一系列特征组合,这些特征组合在彼此正交的情况下最能描述数据。它通常用于降维。在“身高与体重”的例子中,对应于身高和体重的线性组合的第一主成分 (first principal component,PC1) 描述了强正相关,而 PC2 可能描述了与身高与体重不强相关的其他变量,例如体脂百分比或肌肉质量百分比。(e) 聚类使用各种算法中的一种,来对相似的对象进行分组(例如,根据基因表达谱对细胞类型进行分组)。
人工神经网络
人工神经网络
图4. 神经网络方法。(a) 多层感知机由节点(显示为圆)组成,代表数字:输入值、输出值或内部(隐藏)值。节点排列在具有连接的层中,这些连接意味着学习后的参数,位于一层的每个节点和下一层的每个节点之间。例如,分子特性可用于预测药物毒性,因为预测可以从独立输入特征(分子特性)的一些复杂组合中进行。(b) 卷积神经网络 (CNN) 使用跨输入层移动的过滤器,来计算下一层中的值。过滤器跨整个层运行,意味着参数是共享的,无论位置如何,都可以检测到相似的实体。二维 CNN 显示在显微镜图像上运行,但一维和三维 CNN 在生物学中也有应用。这里的维度指的是数据的空间维度;与此相对应,可以配置 CNN 内部的连接性。例如,生物序列可以被认为是一维的,而磁共振成像数据可以被认为是三维的。(c) 循环神经网络 (RNN) 使用相同的学到的参数来处理顺序输入的每个部分,为每个输入提供输出和更新的隐藏状态。隐藏状态用于携带序列前面部分的信息。在这个例子里,可以预测 DNA 序列中每个碱基与转录因子的结合概率。RNN 被展开,方便展示每个输出是如何使用相同的层生成的;不应该被混淆为是使用不同的层。(d) 图卷积网络使用来自图中连接节点的信息(例如蛋白质-蛋白质相互作用网络),通过组合来自所有相邻节点的预测,来更新网络中的节点属性。更新后的节点属性形成网络中的下一层,并在输出层中预测所需的属性。(e) 自编码器由编码器神经网络和解码器神经网络组成,编码器神经网络将输入转换为低维潜在表示,解码器神经网络将这种潜在表示转换回原始输入形式。例如,可以编码蛋白质序列并使用潜在表示来生成新的蛋白质序列。在示例中,5 个残基中有 4 个与自动编码器编码和解码后的输入相同,表明在该序列上的准确率为 80%。
生物应用的挑战
生物应用的挑战
表2 对于不同生物数据类型使用机器学习策略的建议
方框2 | 评估使用机器学习的文章
以下是在阅读或审阅使用机器学习处理生物数据的文章时需要考虑的一些问题。即使不一定得到有效回答,记住这些问题也是有用的,并且这些问题可以用作与机器学习背景合作者进行讨论的基础。有惊人数量的文章并不能达到这些标准[148]。
是否充分描述了数据集?
研究人员应提供构建数据集的完整步骤,最好使用在网页中永久保留的数据集或摘要数据。根据我们的经验,对机器学习方法进行全面描述,但对数据的描述却含糊其辞,是危险信号(red flag)。如果使用的数据集是标准数据集,或来自另一项研究,则应在文章中明确说明。
测试集是否确凿?
根据生物应用挑战部分中的讨论,检查测试集是否足以对所调查的属性进行基准测试。训练集和测试集之间不应该有数据泄漏,测试集应该足够大以提供可靠的结果,并且测试集应该覆盖该工具的用户可能的使用范围。研究人员同样应该详细讨论训练集和测试集的组成和大小。作者有责任确保已采取所有步骤以避免数据泄露,并且应在文章中描述这些步骤及其背后的基本原理。学术期刊编辑和审稿人还应该确保这些任务已经按照良好的标准执行,而非仅靠论文作者自觉。
模型选择是否合理?
研究人员应给出选择机器学习方法的理由。使用神经网络的理由是因为它们适用于当前数据和问题,而非仅仅是因为其他人都在使用。学术界应鼓励讨论已尝试但无效的模型,因为它可以帮助其他研究人员;很多时候,一个复杂的模型没有对最终得到该模型所需的不可避免的反复试验进行任何讨论,但依然得以发表。
该方法是否与其他方法进行了比较?
应该将一种新方法与表现出良好性能并被广泛使用的现有方法进行比较。理想情况下,应比较使用各种模型类型的方法,这样有助于解释结果。令人惊讶的是,许多复杂模型在性能上其实通过简单的回归方法就可以匹配。
结果好得令人难以置信吗?
超过 99% 准确率的声明,在生物学的机器学习文章中并不少见。通常,这是测试出现问题的迹象,而不是惊人的突破。作者和审稿人都应该注意这一点。
方法可用吗?
至少,想要使用文章中训练好模型的科研人员,应该能够基于网页服务或代码文件运行一次预测。理想情况下,至少在永久 URL 和通用许可下,应该提供源代码和经过训练的模型[149,150]。使训练代码可用,也是理想情况下该有的,因为这进一步提高了文章的可重复性并允许其他研究人员以该方法为基础,而无需从头开始。期刊应该在这里承担一些责任,以确保这成为常态。
未来发展方向
未来发展方向
参考文献
1. Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).2.Mitchell, T. M. Machine Learning (McGraw Hill, 1997).3.Goodfellow, I., Bengio Y. & Courville, A. Deep Learning (MIT Press, 2016).4. Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet.16, 321–332 (2015).5. Zou, J. et al. A primer on deep learning in genomics.Nat. Genet. 51, 12–18 (2019).6.Myszczynska, M. A. et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol. 16,440–456 (2020).7.Yang, K. K., Wu, Z. & Arnold, F. H. Machine- learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).8.Tarca, A. L., Carey, V. J., Chen, X.-W., Romero, R.& Drăghici, S. Machine learning and its applications to biology. PLoS Comput. Biol. 3, e116 (2007).This is an introduction to machine learning concepts and applications in biology with a focus on traditional machine learning methods.9.Silva, J. C. F., Teixeira, R. M., Silva, F. F.,Brommonschenkel, S. H. & Fontes, E. P. B. Machine learning approaches and their current application in plant molecular biology: a systematic review.Plant. Sci. 284, 37–47 (2019).10. Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of druggable proteins using machine learning and systems biology: a mini- review. Front. Physiol. 6, 366 (2015).11. Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience.Front. Comput. Neurosci. 10, 94 (2016).12. Jiménez- Luna, J., Grisoni, F. & Schneider, G.Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020).13. Buchan, D. W. A. & Jones, D. T. The PSIPRED Protein Analysis Workbench: 20 years on. Nucleic Acids Res.47, W402–W407 (2019).14. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26,990–999 (2016).15. Altman, N. & Krzywinski, M. Clustering. Nat. Methods 14, 545–546 (2017).16. Hopf, T. A. et al. Mutation effects predicted from sequence co- variation. Nat. Biotechnol. 35, 128–135(2017).17. Zhang, Z. et al. Predicting folding free energy changes upon single point mutations. Bioinformatics 28,664–671 (2012).18. Pedregosa, F. et al. Scikit- learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).19. Kuhn, M. Building predictive models in r using the caret package. J. Stat. Softw. 28, 1–26 (2008).20. Blaom, A. D. et al. MLJ: a Julia package for composable machine learning. J. Open Source Softw.5, 2704 (2020).21. Jones, D. T. Setting the standards for machine learning in biology. Nat. Rev. Mol. Cell Biol. 20, 659–660(2019).22. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA- binding proteins by deep learning.Nat. Biotechnol. 33, 831–838 (2015).23. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning.Nature 577, 706–710 (2020).Technology company DeepMind entered the CASP13 assessment in protein structure prediction and its method using deep learning was the most accurate of the methods entered.24. Esteva, A. et al. Dermatologist- level classification of skin cancer with deep neural networks. Nature 542,115–118 (2017).25. Tegunov, D. & Cramer, P. Real- time cryo- electron microscopy data preprocessing with Warp.Nat. Methods 16, 1146–1152 (2019).26. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015). This is a review of deep learning by some of the major figures in the deep learning revolution.27. Hastie T., Tibshirani R., Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd Edn. (Springer Science & Business Media; 2009).28. Adebayo, J. et al. Sanity checks for saliency maps. NeurIPS https://arxiv.org/abs/1810.03292 (2018).29. Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. ICML 48, 1050–1059 (2016).30. Smith, A. M. et al. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinformatics 21, 119 (2020).31. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. 58, 267–288(1996).32. Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. 67,301–320 (2005).33. Noble, W. S. What is a support vector machine? Nat. Biotechnol. 24, 1565–1567 (2006).34. Ben- Hur, A. & Weston, J. A user’s guide to support vector machines. Methods Mol. Biol. 609, 223–239(2010).35. Ben- Hur, A., Ong, C. S., Sonnenburg, S., Schölkopf, B.& Rätsch, G. Support vector machines and kernels for computational biology. PLoS Comput. Biol. 4,e1000173 (2008). This is an introduction to SVMs with a focus on biological data and prediction tasks.36. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants.Nat. Genet. 46, 310–315 (2014).37. Driscoll, M. K. et al. Robust and automated detection of subcellular morphological motifs in 3D microscopy images. Nat. Methods 16, 1037–1044 (2019).38. Bzdok, D., Krzywinski, M. & Altman, N. Machine learning: supervised methods. Nat. Methods 15, 5–6 (2018).39. Wang, C. & Zhang, Y. Improving scoring- docking-screening powers of protein- ligand scoring functions using random forest. J. Comput. Chem. 38, 169–177(2017).40. Zeng, W., Wu, M. & Jiang, R. Prediction of enhancer- promoter interactions via natural language processing. BMC Genomics 19, 84 (2018).41. Olson, R. S., Cava, W. L., Mustahsan, Z., Varik, A. & Moore, J. H. Data- driven advice for applying machine learning to bioinformatics problems. Pac. Symp. Biocomput. 23, 192–203 (2018).42. Rappoport, N. & Shamir, R. Multi- omic and multi- view clustering algorithms: review and cancer benchmark.Nucleic Acids Res. 47, 1044 (2019).43. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35,1026–1028 (2017).44. Jain, A. K. Data clustering: 50 years beyond K- means.Pattern Recognit. Lett. 31, 651–666 (2010).45. Ester M., Kriegel H.-P., Sander J., Xu X. A density- based algorithm for discovering clusters in large spatial databases with noise. KDD‘96 Proc. Second Int. Conf. Knowl. Discov. Data Mining. 96, 226–231 (1996).46. Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol. 15,e1006907 (2019).47. Moon, K. R. et al. Visualizing structure and transitions in high- dimensional biological data. Nat. Biotechnol.37, 1482–1492 (2019).48. van der Maaten, L. & Hinton, G. Visualizing data using t- SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).49. Kobak, D. & Berens, P. The art of using t- SNE for single- cell transcriptomics. Nat. Commun. 10, 5416 (2019).This article provides a discussion and tips for using t- SNE as a dimensionality reduction technique on single- cell transcriptomics data.50. Crick, F. The recent excitement about neural networks.Nature 337, 129–132 (1989).51. Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020). This article discusses a common problem in deep learning called ‘shortcut learning’, where the model uses decision rules that do not transfer to real- world data.52. Qian, N. & Sejnowski, T. J. Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol. 202, 865–884 (1988).53. deFigueiredo, R. J. et al. Neural- network-based classification of cognitively normal, demented, Alzheimer disease and vascular dementia from single photon emission with computed tomography image data from brain. Proc. Natl Acad. Sci. USA 92, 5530–5534 (1995).54. Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S.DeepTox: toxicity prediction using deep learning.Front. Environ. Sci. 3, 80 (2016).55. Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).56. Xu, J., Mcpartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. 3,601–609 (2021).57. Poplin, R. et al. A universal SNP and small- indel variant caller using deep neural networks.Nat. Biotechnol. 36, 983–987 (2018).58. Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita.Nat. Methods 17, 1111–1117 (2020).59. Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K.Convolutional neural network architectures for predicting DNA- protein binding. Bioinformatics 32,i121–i127 (2016).60. Yao, R., Qian, J. & Huang, Q. Deep- learning with synthetic data enables automated picking of cryo- EM particle images of biological macromolecules.Bioinformatics 36, 1252–1259 (2020).61. Si, D. et al. Deep learning to predict protein backbone structure from high- resolution cryo- EM density maps.Sci. Rep. 10, 4282 (2020).62. Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng. 2, 158–164 (2018).63. AlQuraishi, M. End- to-end differentiable learning of protein structure. Cell Syst. 8, 292–301.e3 (2019).64. Heffernan, R., Yang, Y., Paliwal, K. & Zhou, Y.Capturing non- local interactions by long short- term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure,backbone angles, contact numbers and solvent accessibility. Bioinformatics 33, 2842–2849 (2017).65. Müller, A. T., Hiss, J. A. & Schneider, G. Recurrent neural network model for constructive peptide design.J. Chem. Inf. Model. 58, 472–479 (2018).66. Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F.& Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf.Proc. 56, 301–318 (2016).67. Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44,e107 (2016).68. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M.& Church, G. M. Unified rational protein engineering with sequence- based deep representation learning.Nat. Methods 16, 1315–1322 (2019).69. Vaswani, A. et al. Attention is all you need.arXiv https://arxiv.org/abs/1706.03762 (2017).70. Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv https://arxiv.org/abs/2007.06225 (2020).71. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).72. Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks. arXiv https://arxiv.org/ abs/1806.01261 (2018).73. Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 181, 475–483 (2020). In this work, a deep learning model predicts antibiotic activity, with one candidate showing broad- spectrum antibiotic activities in mice.74. Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17, 184–192 (2020).75. Strokach, A., Becerra, D., Corbi- Verge, C., Perez- Riba, A.& Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst. 11, 402–411.e4(2020).76. Gligorijevic, V. et al. Structure-based function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).77. Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34, i457–i466 (2018).78. Veselkov, K. et al. HyperFoods: machine intelligent mapping of cancer- beating molecules in foods.Sci. Rep. 9, 9237 (2019).79. Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch geometric. arXiv https://arxiv.org/abs/1903.02428 (2019).80. Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37, 1038–1040 (2019).81. Wang, Y. et al. Predicting DNA methylation state of CpG dinucleotide using genome topological features and deep networks. Sci. Rep. 6, 19598(2016).82. Linder, J., Bogard, N., Rosenberg, A. B. & Seelig, G.A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences.Cell Syst. 11, 49–62.e16 (2020).83. Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 16189(2018).84. Wang, J. et al. scGNN is a novel graph neural network framework for single- cell RNA- Seq analyses. Nat. Commun. 12, 1882 (2021).85. Paszke, A. et al. PyTorch: an imperative style, high- performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019).86. Abadi M. et al. Tensorflow: a system for large- scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation.265–283 (USENIX, 2016).87. Wei, Q. & Dunbrack, R. L. Jr The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE 8, e67863 (2013).88. Walsh, I., Pollastri, G. & Tosatto, S. C. E. Correct machine learning on protein sequences: a peer-reviewing perspective. Brief. Bioinform 17, 831–840(2016).This article discusses how peer reviewers can assess machine learning methods in biology, and by extension how scientists can design and conduct such studies properly.89. Schreiber, J., Singh, R., Bilmes, J. & Noble, W. S.A pitfall for machine learning methods aiming to predict across cell types. Genome Biol. 21, 282(2020).90. Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986).91. Söding, J. & Remmert, M. Protein sequence comparison and fold recognition: progress and good- practice benchmarking. Curr. Opin. Struct. Biol.21, 404–411 (2011).92. Steinegger, M. et al. HH- suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).93. Sillitoe, I. et al. CATH: expanding the horizons of structure- based functional annotations for genome sequences. Nucleic Acids Res. 47, D280–D284(2019).94. Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol. 10, e1003926(2014).95. Li, Y. & Yang, J. Structural and sequence similarity makes a significant impact on machine- learning-based scoring functions for protein- ligand interactions.J. Chem. Inf. Model. 57, 1007–1012 (2017).96. Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross- sectional study. PLoS Med. 15,e1002683 (2018).97. Szegedy, C. et al. Intriguing properties of neural networks. arXiv https://arxiv.org/abs/1312.6199 (2014).98. Hie, B., Cho, H. & Berger, B. Realizing private and practical pharmacological collaboration. Science 362,347–350 (2018).99. Beaulieu- Jones, B. K. et al. Privacy- preserving generative deep neural networks support clinical data sharing. Circ. Cardiovasc. Qual. Outcomes 12,e005122 (2019).100. Konečný, J., Brendan McMahan, H., Ramage, D.& Richtárik, P. Federated optimization: distributed machine learning for on-device intelligence. arXiv https://arxiv.org/abs/1610.02527 (2016).101. Pérez, A., Martínez- Rosell, G. & De Fabritiis, G.Simulations meet machine learning in structural biology. Curr. Opin. Struct. Biol. 49, 139–144 (2018).102. Noé, F., Olsson, S., Köhler, J. & Wu, H. Boltzmann generators: sampling equilibrium states of many-body systems with deep learning. Science 365, 6457(2019).103. Shrikumar, A., Greenside, P. & Kundaje, A. Reverse-complement parameter sharing improves deep learning models for genomics. bioRxiv https://www.biorxiv.org/content/10.1101/103663v1 (2017).104. Lopez, R., Gayoso, A. & Yosef, N. Enhancing scientific discoveries in molecular biology with deep generative models. Mol. Syst. Biol. 16, e9198 (2020).105. Anishchenko, I., Chidyausiku, T. M., Ovchinnikov, S.,Pellock, S. J. & Baker, D. De novo protein design by deep network hallucination. bioRxiv https://doi.org/10.1101/2020.07.22.211482 (2020).106. Innes, M. et al. A differentiable programming system to bridge machine learning and scientific computing. arXiv https://arxiv.org/abs/1907.07587 (2019).107. Ingraham J., Riesselman A. J., Sander C., Marks D. S.Learning protein structure with a differentiable simulator.ICLR https://openreview.net/forum?id=Byg3y3C9Km(2019).108. Jumper, J. M., Faruk, N. F., Freed, K. F. & Sosnick, T. R.Trajectory- based training enables protein simulations with accurate folding and Boltzmann ensembles in cpu- hours. PLoS Comput. Biol. 14, e1006578 (2018).109. Wang, Y., Fass, J. & Chodera, J. D. End-to-end differentiable molecular mechanics force field construction. arXiv http://arxiv.org/abs/2010.01196(2020).110. Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. GitHub http://github.com/google/jax (2018).111. Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G.Selene: a PyTorch- based deep learning library for sequence data. Nat. Methods 16, 315–318 (2019).This work provides a software library based on PyTorch providing functionality for biological sequences.112. Kopp, W., Monti, R., Tamburrini, A., Ohler, U.& Akalin, A. Deep learning for genomics using Janggu. Nat. Commun. 11, 3488 (2020).113. Schoenholz, S. S. & Cubuk, E. D. JAX, M.D.: end-to-end differentiable, hardware accelerated, molecular dynamics in pure Python. arXiv https://arxiv.org/abs/1912.04232 (2019).114. Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).115. Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J.& Maier- Hein, K. H. nnU- Net: a self- configuring method for deep learning- based biomedical image segmentation. Nat Methods 18, 203–211 (2020).116. Livesey, B. J. & Marsh, J. A. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol. 16, e9380 (2020).117. AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure.BMC Bioinformatics 20, 311 (2019).118. Townshend, R. J. L. et al. ATOM3D: tasks on molecules in three dimensions. arXiv https://arxiv.org/abs/2012.04035 (2020).119. Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural. Inf. Process. Syst. 32, 9689–9701(2019).120. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K.& Moult, J. Critical assessment of methods of protein structure prediction (CASP) — round XIII. Proteins 87,1011–1020 (2019).121. Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).122. Munro, D. & Singh, M. DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction. Bioinformatics 36, 5322–5329(2020).123. Haario, H. & Taavitsainen, V.-M. Combining soft and hard modelling in chemical kinetic models. Chemom.Intell. Lab. Syst. 44, 77–98 (1998).124. Cozzetto, D., Minneci, F., Currant, H. & Jones, D. T.FFPred 3: feature- based function prediction for all gene ontology domains. Sci. Rep. 6, 31865 (2016).125. Nugent, T. & Jones, D. T. Transmembrane protein topology prediction using support vector machines.BMC Bioinformatics 10, 159 (2009).126. Bao, L., Zhou, M. & Cui, Y. nsSNPAnalyzer: identifying disease- associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res. 33, W480–W482(2005).127. Li, W., Yin, Y., Quan, X. & Zhang, H. Gene expression value prediction based on XGBoost algorithm. Front.Genet. 10, 1077 (2019).128. Zhang, Y. & Skolnick, J. SPICKER: a clustering approach to identify near- native protein folds. J. Comput. Chem.30, 865–871 (2004).129. Teodoro, M. L., Phillips, G. N. Jr & Kavraki, L. E.Understanding protein flexibility through dimensionality reduction. J. Comput. Biol. 10,617–634 (2003).130. Schlichtkrull, M. et al. Modeling relational data with graph convolutional networks. arXiv https://arxiv.org/abs/1703.06103 (2019).131. Pandarinath, C. et al. Inferring single- trial neural population dynamics using sequential auto- encoders.Nat. Methods 15, 805–815 (2018).132. Antczak, M., Michaelis, M. & Wass, M. N.Environmental conditions shape the nature of a minimal bacterial genome. Nat. Commun. 10, 3100 (2019).133. Sun, T., Zhou, B., Lai, L. & Pei, J. Sequence- based prediction of protein protein interaction using a deep- learning algorithm. BMC Bioinformatics 18,277 (2017).134. Hiranuma, N. et al. Improved protein structure refinement guided by deep learning based accuracy estimation. Nat. Commun. 12, 1340 (2021).135. Pagès, G., Charmettant, B. & Grudinin, S. Protein model quality assessment using 3D oriented convolutional neural networks. Bioinformatics 35,3313–3319 (2019).136. Pires, D. E. V., Ascher, D. B. & Blundell, T. L. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach.Nucleic Acids Res. 42, W314–W319 (2014).137. Yuan, Y. & Bar- Joseph, Z. Deep learning for inferring gene relationships from single- cell expression data.Proc. Natl Acad. Sci. USA 116, 27151–27158 (2019).138. Chen, L., Cai, C., Chen, V. & Lu, X. Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model. BMC Bioinformatics 17, S9 (2016).139. Kantz, E. D., Tiwari, S., Watrous, J. D., Cheng, S.& Jain, M. Deep neural networks for classification of LC- MS spectral peaks. Anal. Chem. 91, 12407–12413(2019).140. Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).141. Liebal, U. W., Phan, A. N. T., Sudhakar, M., Raman, K. & Blank, L. M. Machine learning applications for mass spectrometry- based metabolomics. Metabolites 10,243 (2020).142. Zhong, E. D., Bepler, T., Berger, B. & Davis, J. H. CryoDRGN: reconstruction of heterogeneous cryo- EM structures using neural networks. Nat. Methods 18,176–185 (2021).143. Schmauch, B. et al. A deep learning model to predict RNA- Seq expression of tumours from whole slide images. Nat. Commun. 11, 3877 (2020).144. Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng. 5, 613–623 (2021).145. Gligorijevic, V., Barot, M. & Bonneau, R. deepNF:deep network fusion for protein function prediction.Bioinformatics 34, 3873–3881 (2018).146. Karpathy A. A recipe for training neural networks. https://karpathy.github.io/2019/04/25/recipe (2019).147. Bengio, Y. Practical recommendations for gradient-based training of deep architectures. Lecture Notes Comput. Sci. 7700, 437–478 (2012).148. Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 3,199–217 (2021).This study assesses 62 machine learning studies that analyse medical images for COVID-19 and none is found to be of clinical use, indicating the difficulties of training a useful model.149. List, M., Ebert, P. & Albrecht, F. Ten simple rules for developing usable software in computational biology.PLoS Comput. Biol. 13, e1005265 (2017).150. Sonnenburg, S. Ã., Braun, M. L., Ong, C. S. & Bengio, S.The need for open source software in machine learning.J. Mach. Learn. Res. 8, 2443–2466 (2007).
(参考文献可上下滑动查看)
复杂科学最新论文
集智斑图顶刊论文速递栏目上线以来,持续收录来自Nature、Science等顶刊的最新论文,追踪复杂系统、网络科学、计算社会科学等领域的前沿进展。现在正式推出订阅功能,每周通过微信服务号「集智斑图」推送论文信息。扫描下方二维码即可一键订阅:
推荐阅读
Cell 长文综述:机器学习如何助力网络生物学 Science评论:人工智能需要真实的生物大脑机制吗? Science最新:用机器学习建模人类的风险认知 《张江·复杂科学前沿27讲》完整上线! 成为集智VIP,解锁全站课程/读书会 加入集智,一起复杂!
点击“阅读原文”,追踪复杂科学顶刊论文